Contract.Information Contract.Content
Project ID XXXXXXXXXX
Project Name Kinnex Full-Length RNA-seq with Reference Genome Report (PacBio) (WBI)
Report ID XXXXXXXXXX-XX-XXXX
Report Time 20xx-xx-xx

1 Background


1.1 Introduction


  Although next-generation sequencing based RNA Sequencing (RNA-seq) has become an important approach to study gene expression profiles and characterize gene structure, there are limitations for this technique to depict reliable landscape of transcript isofrms. Short reads cannot span to full-length transcripts unless assembled by an assembly software. Moreover, short reads assembly is subject to errors which make them less accurate in differentiating alternative-splicing transcripts.The SMRT (Single Molecule Real-Time) sequencing technology from Pacific Biosciences (PacBio) is able to sequence the full-length transcript in real time, from 5' end of transcripts to the poly(A) tail — eliminating the need for transcriptome reconstruction by a bioinformatic analysis software based on isoform-inference algorithms. The full-length Isoform Sequencing (Iso-Seq) powered by PacBio SMRT sequencing technology enables accurate detection of isoforms structure.

1.2 Iso-Seq Advantages


  PacBio SMRT sequencing can be applied to transcriptome study (Iso-Seq), for its following advantages:

  1. Full-length transcript sequences can be obtained directly without the need for transcript assembly.

  2. Iso-Seq helps to discover more novel genes, novel isoforms, fusion-trasncripts.

  3. Iso-Seq provides more accurate information about homologous gene, allele expression, open reading frame prediction, alternative polyadenylation (APA), and superfamily genes.

2 Library and Sequencing


2.1 Sample Quality Control


  Sample quality control methods are mentioned in the sample QC report.

2.2 Library Construction, Quality Control and Sequencing


  Synthesize cDNA strands using the Iso-Seq Express 2.0 kit, and then prepare the library using the Kinnex Full-Length RNA kit. The library preparation steps are as follows:

Figure 2.2 Library construction workflow

  The library’s concentration was quantified with Qubit and its insert size distribution was detected with the Agilent 2100 Bioanalyzer. Quantified libraries were pooled and sequenced on PacBio platform, according to effective library concentration and data amount required.

3 Bioinfomatics Workflow


3.1 Analysis Pipeline


  The two key components of PacBio SMRT Iso-Seq sequencing are the structural analysis of transcripts and the analysis of differentially expressed genes, with the latter being conducted when NGS (Next Generation Sequencing) data is available. The bioinformatics analysis pipeline is illustrated as follows.

Figure 3.1 Bioinformatics workflow

4 Result Display and Explanation


4.1 Analysis of Full-Length Transcripts


  On the Revio platform, the sequences are directly obtained as HiFi reads. These HiFi reads are then filtered and processed using the software SMRTlink as follows:

1. Identify full-length non-chimera (FLNC) and non-full-length (nFL) sequences by classifying the circular consensus sequences (CCS) based on the presence of 5'-primer, 3'-primer, and poly-A tail.
2. Cluster FLNC sequences from the same transcript using the hierarchical n*log(n) algorithm to obtain consensus sequences.

4.1.1 HiFi Statistics

Table 4.1.1 HiFi summary

  • Sample:sample name
  • HiFi_number:the number of HiFi reads.
  • Min_length (bp):the length of shortest HiFi read.
  • Max_length(bp):the length of longest HiFi read.
  • Mean_length(bp):the average length of HiFi read.
  • N50(bp): the read length at which 50% of the bases are in reads longer than, or equal to, this value.

  The length distribution of HiFi is depicted in the following histogram:

Figure 4.1.1 The length distribution of HiFi reads

4.1.2 FLNC Statistics

  In the SMRT analysis software, a read with both 3' and 5' primer sequences, as well as a poly(A) tail before the 3' primer sequence, is defined as full-length sequences (FL reads), while a read which lacks any of these three sequences is called non-full-length read. The non-chimeric sequence in a FL read is referred as Full-Length Non-Chimeric Read (FLNC).

Table 4.1.2 The statistical outcomes for FLNC

Note: We have set Copy/Print/Download and Search functions for this form, click Copy to copy the data in the form, users can create a new excel or text document arbitrarily, paste the data into it, and retain the original format of rows and columns ;Click Print to print the data in the table (if the printer is connected); click Download to download the data in the table in three formats: CSV, Excel, and PDF. Enter the specified information in the Search box, and all the content containing the information will be displayed in the table below. The Copy/Print/Search functions involved in the table below are the same as here.

  • Sample:sample name
  • FLNC_number:the number of FLNC reads.
  • Min_length(bp):the length of shortest FLNC.
  • Max_length(bp):the length of longest FLNC.
  • Mean_length(bp):the average length of FLNC.
  • N50(bp):the read length at which 50% of the bases are in reads longer than, or equal to, this value.

  The length distribution of FLNC is depicted in the following histogram:

Figure 4.1.2 The length distribution of FLNC

4.1.3 Consensus Reads Statistics

  The full-length transcriptome employs a hierarchical n*log(n) algorithm to cluster and eliminate redundancy of FLNC sequences of the same transcript, obtaining consensus sequences. Subsequently, the Arrow software corrects the obtained consensus sequences, resulting in polished consensus sequences for downstream analysis.

Table 4.1.6 Statistics of polished consensus

Note: We have set Copy/Print/Download and Search functions for this form, click Copy to copy the data in the form, users can create a new excel or text document arbitrarily, paste the data into it, and retain the original format of rows and columns ;Click Print to print the data in the table (if the printer is connected); click Download to download the data in the table in three formats: CSV, Excel, and PDF. Enter the specified information in the Search box, and all the content containing the information will be displayed in the table below. The Copy/Print/Search functions involved in the table below are the same as here.

  • Sample:sample name
  • Consensus_number:the number of consensus reads.
  • Min_length(bp):the shortest polished consensus length.
  • Max_length(bp):the longest polished consensus length.
  • Mean_length(bp):the average length of polished consensus.
  • N50(bp):the read length at which 50% of the bases are in reads longer than, or equal to, this value.

  The length distribution of polished consensus reads is depicted in the following histogram:

Figure 4.1.6 The length distribution of polished consensus

Note: The x-axis is the read length of polished consensus reads, and the y-axis is the number of polished consensus reads.

  Result file directory: XXX_Result_XXXX/01.SMRT

4.2 Analysis of Alignment with the Reference Genome


4.2.1 Summary of the Alignment Results

  The polished consensus sequences are aligned to the reference genome using the Genomic Mapping and Alignment Program (GMAP). Based on the alignment results, the sequences are divided into five categories: Unmapped, Multiple mapped, Uniquely mapped, Reads map to '+', and Reads map to '-'. The consensus sequence and the reference genome alignment results is presented in the table below:

Table 4.2.1 Mapping Summary

  • Sample: sample name
  • Total reads: the total reads number of consensus sequences.
  • Total mapped: the statistics of consensus sequences aligned to the reference genome.
  • Unmapped: the statistics of consensus sequences that are unmapped to the reference genome.
  • Multiple mapped: the statistics of consensus sequences aligning to multiple positions on the reference genome.
  • Uniquely mapped: the statistics of consensus sequences aligning uniquely to the reference genome.
  • Reads map to '+': the statistics of consensus sequences map to forward strand.
  • Reads map to '-': the statistics of consensus sequences map to reverse strand.

Figure 4.2.1.1 Visualization of the mapping outcomes

  Based on the alignment results, the distribution of coverage and identity for each transcript is concluded in the figure below.

Figure 4.2.1.2 Illustrates the distribution of coverage and identity

Note: The x-axis represents the range of variation in coverage and identity, while the y-axis denotes the percentage of transcripts. The term "identity" indicates the consistency of the sequence. This value does not have a fixed range, with a higher numerical value indicating a higher degree of sequence homology. The term "coverage" represents the ratio of the length of the fully covered transcriptome to the size of the reference gene. A higher numerical value signifies a better quality of sequencing.

  Based on the comparative analysis, a statistical assessment of the distinct categories of junction sites is performed. The results are shown in the figure below:

Figure 4.2.1.3 Statistical results of junction sites

Note: "GT-AG/GC-AG" represents a typical GT-AG/GC-AG shared sequence at each intron exon splice junction.

4.2.2 The Saturation Curve Analysis

  The saturation curve provides a visual representation of the relationship between the total transcriptome and the number of covered genes. Conduct saturation curve analysis of the gene count mapped by FLNC reads at different depths to observe whether the overall gene count reaches saturation levels. Evaluate whether the sequencing data is sufficient for subsequent analysis. The analysis results are shown in the following figure:

Figure 4.2.2 Saturation curve

Note: The x-axis denotes the number of sequenced FLNC reads, while the y-axis represents the number of genes covered by FLNC reads.

4.2.3 The Density of Full-Length Transcripts

  The density of total mapped reads aligned to the reference genome (both forward and reverse strands) is statistically analyzed.

Figure 4.2.3.1 Distribution of full-length transcripts on the reference genome

Note: The drawing method involves using a sliding window size of 5K to calculate the median of reads aligned to the base position within the window, and convert it to log2. The x-axis represents the length of the chromosome (in millions of base pairs), while the y-axis represents the log2 median reads density. The green color indicates the forward strand, and the red color indicates the reverse strand. (By default, 15 chromosomes are selected for display, and if there are fewer than 15 chromosomes, all of the chromosomes will be displayed.)

  Under normal circumstances, the longer the length of a chromosome, the greater the total number of reads localized within that chromosome. The relationship between the number of reads localized on the chromosome and the chromosome length is more visually evident from the graph. (A maximum of 15 chromosomes are displayed in the figure.)

Figure 4.2.3.2 The relationship between the number of reads and the length of the chromosome

Note: The x-axis represents the length of the chromosome in megabases (Mb), and the y-axis represents the number of reads mapped to the chromosome (Mb,The gray area in the figure represents a 95% confidence interval).

4.2.4 Functional Annotation

  In order to obtain comprehensive annotation information, the transcripts (both the mapped and unmapped transcripts ) are functionally annotated in seven databases (NR, NT, Pfam, kog/cog, Swissprot, KEGG, GO). To facilitate the understanding of the annotation status in each database, the number of successfully annotated transcripts in the seven major databases is statistically analyzed.

Figure 4.2.4 The annotated results from seven major databases

  • NR: the number of annotated transcripts in the NR database.
  • SwissProt: the number of annotated transcripts in the SwissProt database.
  • KEGG: the number of annotated transcripts in the KEGG database.
  • KOG: the number of annotated transcripts in the KOG/COG database.
  • GO: the number of annotated transcripts in the GO database.
  • NT: the number of annotated transcripts in the NT database.
  • Pfam: the number of annotated transcripts in the Pfam database.
  • All in Databases: the total number of annotated transcripts across all databases.
  • One in Databases: at least one database annotation has successfully annotated the number of transcripts.
  • Total Annotation: the number of all annotated transcripts.

4.2.4.1 NR Annotation

  The Non-Redundant Protein Database (NR), established and maintained by the National Center for Biotechnology Information (NCBI), represents a comprehensive repository of non-redundant protein sequences. Its annotations encompass species-specific information, facilitating taxonomical classification. Leveraging NR database annotations affords access to functional insights into the genes of the respective species.The annotation results are classified as follows:

Figure 4.2.4.1 Annotation of NR database

Note: The x-axis represents the species ID, while the y-axis represents the number of annotated genes.

4.3.4.2 GO Annotation

  GO (Gene Ontology) is the established standard for gene function annotation. The GO provides controlled vocabularies to classify different gene functions: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). GO term is the basic unit of GO system. Each term has a unique identifier. The relationship between the GO term of each ontology can form a Directed Acyclic Topology. More details could be found from the website: Gene Ontology Resource. Novel genes and novel isoforms detected by Transcriptome analysis are annotation with GO terms and the annotation results are classified as follows:

Figure 4.2.4.2 GO annotation classification

Note: The x-axis represents the sub-terms of GO terms for the three main categories of GO, and the y-axis represents the number of genes annotated to this term (including the sub-terms of the term). The three different classifications represent the three fundamental classifications of GO terms (from left to right: biological process, cellular component, molecular function).

4.3.4.3 KOG/COG Annotation

  KOG/COG, short for Cluster of Orthologous Groups of proteins, are divided into two categories: one is the prokaryotic organism database, commonly referred to as the COG database; the other is the eukaryotic organism database, generally known as the KOG database. This database is created and is maintained by NCBI, and is constructed based on the evolutionary relationships of encoded protein systems in complete genomes of bacteria, algae, and eukaryotes. By aligning a protein sequence to a specific KOG/COG, which is composed of orthologous sequences, one can infer the function of that sequence. The KOG/COG database can be classified into a total of twenty-six functional categories, and the annotation results of the KOG/COG database are as follows:

Figure 4.2.4.3 Annotation of KOG/COG database

Note: The x-axis denotes the names of the twenty-six groups of KOG/COG, while the y-axis represents the proportion of genes annotated to each group compared to the total number of annotated genes.

4.3.4.4 KEGG Annotation

  The full name of KEGG is the Kyoto Encyclopedia of Genes and Genomes. It is a database that performs systematic analysis of the metabolic pathways of gene products and compounds within cells, as well as the functions of these gene products. KEGG integrates data from various aspects such as genomics, chemical molecules, and biochemical systems, including metabolic pathways (KEGG PATHWAY), drugs (KEGG DRUG), diseases (KEGG DISEASE), functional models (KEGG MODULE), gene sequences (KEGG GENES), and genomes (KEGG GENOME), among others. The KO (KEGG ORTHOLOG) system links various KEGG annotation systems together, and KEGG has established a comprehensive KO annotation system, which can complete functional annotation of the genome or transcriptome of new species. After annotating genes with KO, they can be classified according to the KEGG metabolic pathways in which they participate.

Figure 4.2.4.4 KEGG Metabolic pathway classification

Note: The numbers on the bar graph represent the number of annotated genes. The y-axis represents the codes of level 1 functional categories, with their explanations detailed in the corresponding legend.

4.2.4.5 Pfam Annotation

  Pfam is a widely utilized protein family database, encompassing over 13,000 confirmed protein families. This database comprises two entities: a high-quality, confirmed Pfam-A database, and a supplementary Pfam-B database that is automatically annotated using the ADDA algorithm. Annotation results are obtained by simultaneously annotating data in both the Pfam-A and Pfam-B databases. The detailed annotation results can be found in the result file.

  Result file catalog:   *_Result_XXXX/04.Structure/04.Annotation/{Unmap_anno|Iso_anno}/Sample/Pfam/

4.2.4.6 SwissProt Annotation

  It's a high quality protein sequence database, which brings together experimental results, computed features and scientific conclusions. More details could be found from the following website:Uniprot EMBL-EBI. Diamond is used to annotate novel genes and novel isoforms to Swissprot Database, the detailed annotation results can be found in the result file.

  Result file catalog:
  *_Result_XXXX/04.Structure/04.Annotation/{Unmap_anno|Iso_anno}/Sample/SwissProt/

4.2.4.7 NT Annotation

  The official nucleotide sequence database of the National Center for Biotechnology Information (NCBI) encompasses nucleotide sequences from GenBank, EMBL, and DDBJ. The data has been aligned and annotated against the database, and the detailed annotation results can be found in the result file.

  Result file catalog:
  *_Result_XXXX/04.Structure/04.Annotation/{Unmap_anno|Iso_anno}/Sample/NT/

4.3 Structure Analysis


4.3.1 Transcriptome Classification and Feature Analysis

4.3.1.1 The Classification of Full-Length Transcripts

  Pigeon is a PacBio Transcript Toolkit designed to classify and filter full-length transcript isoforms into specific categories based on a reference annotation. Based on SQANTI3, Pigeon produces output that is compatible with downstream analysis in Seurat. SQANTI3 is the initial module of the Functional IsoTranscriptomics (FIT) pipeline, serving as a comprehensive approach for conducting isoform-level bioinformatics analyses. This tool is specifically developed to facilitate quality control and filtering of transcriptomes derived from long reads, which frequently contain numerous artifacts and false-positive isoforms. Consequently, meticulous curation of the transcriptome is essential for carrying out FIT analysis and deriving valid, biologically meaningful conclusions or hypotheses. The classification of transcript structures is illustrated in the diagram below:

Figure 4.3.1.1 Transcript structure classification by SQANTI3 software

  The transcript results are shown in the following table:

Table 4.3.1.1 Full-length transcripts classification results

  • Sample: sample name.
  • All_Isoforms: number of transcripts.
  • FSM (Full Splice Match): the reference and query isoform have the same number of exons and each internal junction matches the positions of the reference. The exact 5’ start and 3’ end can differ within the first/last exons.
  • ISM (Incomplete Splice Match): the query isoform has fewer external exons than the reference, but each internal junction matches the positions of the reference. The exact 5’ start and 3’ end can differ within the first/last exons.
  • NIC (Novel In Catalog): the query isoform does not have a FSM or ISM match, but is using a combination of known donor/acceptor sites.
  • NNC (Novel Not in Catalog): the query isoform does not have a FSM or ISM match, and has at least one donor or acceptor site that is not annotated.
  • Antisense: the query isoform does not have overlap a same-strand reference gene but is anti-sense to an annotated gene.
  • Genic Intron: the query isoform is completely contained within a reference intron.
  • Genic Genomic: the query isoform overlaps with introns and exons.
  • Intergenic: the query isoform is in the intergenic region.

  The pie chart below illustrates the categorization of full-length transcripts.

Figure 4.3.1.2 Pie chart depicting the classification of transcripts

4.4.1.2 Transcriptome Characteristic Analysis

  The full-length transcriptome leverages the advantages of PacBio platform, yielding transcript sequences of greater length compared to annotated transcripts from the reference genome. The length density distribution plot of known transcripts from the reference genome and transcripts detected by PacBio Iso-Seq is depicted in the figure below:

Figure 4.4.1.2.1 The density distribution of transcript length

Note: The x-axis represents the length of transcripts, while the y-axis represents the density of transcripts.

  The full-length transcriptome exhibits a more comprehensive structural composition, revealing a greater abundance of novel transcripts. This supplementation enriches the annotation of genes. A comparative analysis is conducted between known transcripts of the reference genome, and the transcript counts detected by PacBio Iso-Seq is depicted in the figure.

Figure 4.4.1.2.2 The distribution of the number of transcripts corresponding to each gene

Note: The x-axis represents the number of transcripts, while the y-axis represents the proportion of transcripts corresponding to genes.

  Full-length transcriptome sequencing can reveal the true structural features of transcripts and identify a more comprehensive number of exons compared to the reference genome. A comparison between the known transcripts in the reference genome and those detected by PacBio Iso-Seq reveals the exon quantity, as depicted in the figure below:

Figure 4.4.1.2.3 The distribution of the number of exons corresponding to each transcript

Note: The x-axis represents the number of exons, while the y-axis represents the proportion of transcripts corresponding to the number of exons.

4.4.2 Alternative Splicing

  The SUPPA software is designed for analyzing alternative splicing in RNA-Seq data. The summary of alternative splicing analysis results is shown below.

Table 4.4.2.1 Alternative splicing summary

  • Sample name: sample name
  • Total genes: the total number of genes annotated in the full-length transcriptome.
  • SE: the proportion of genes involved in SE relative to the total annotated gene count.
  • MX: the proportion of genes involved in MX relative to the total annotated gene count.
  • RI: the proportion of genes involved in RI relative to the total annotated gene count.
  • A5: the proportion of genes involved in A5 relative to the total annotated gene count.
  • A3: the proportion of genes involved in A3 relative to the total annotated gene count.
  • AF: the proportion of genes involved in AF relative to the total annotated gene count.
  • AL: the proportion of genes involved in AL relative to the total annotated gene count.

Figure 4.4.2 Overview of alternative splicing

  For each alternative splicing type, the occurrences of alternative splicing sites and the transcripts generated through alternative splicing are tabulated. The summarized results are as follows.

Table 4.4.2.2 Variable splicing event outcomes

  • Chr_id: the chromosome ID to which the gene is annotated.
  • Gene_id: the gene ID where this alternative splicing occurs.
  • Strand: the information on the gene's for forward or reverse strand, "+" reprensents forward strand and "-" represents reverse strand.
  • Type: the type of alternative splicing.
  • Alt_exon_pos: the start and end points of the exons involved in the alternative splicing (start-end).
  • Alt_trans: the transcript in which this alternative splicing occurs.
  • Total_trans: all transcript IDs supporting this alternative splicing event.

4.4.3 Alternative PolyAdenylation

  In eukaryotes, polyadenylation is part of the process that produces mature mRNA after translation, which forms a poly(A) tail consisting of multiple adenosine monophosphates to the 3' end of a transcript. Many protein-coding genes have several polyadenylation sites which enables a gene to code more than one kinds of transcripts with different 3' end sequences. This phenomenon is called alternative polyadenylation (APA). On the other hand, the length of the 3' untranslated region (3'UTR) is influenced by polyadenylation, which usually contains binding sites for microRNAs. The poly(A) tail is important for the nuclear export, translation, and stability of mRNA. TAPIS is used to uncover the Alternative polyadnenylation sites for the isoforms.

Table 4.4.3.1 Alternative polyAdenylation summary

  • GeneID: gene ID.
  • Strand: the information on the gene's for forward or reverse strand. "+" represents forward strand and "-" represents reverse strand.
  • Aligned reads: reads number.
  • Num sites: the APA site count.
  • Locations: the APA site.
  • Distance: the maximum distance between APA loci.
  • Each_site_of_reads: the read count for each APA site.

Figure 4.4.3.1 Distribution of polyadenylation sites

Note: The x-axis represents the number of variable polyadenylation sites, while the y-axis represents the number of genes.

  Based on the statistical results of poly(A), select the main poly(A) site information. The selection criteria are: (1) the gene has at least ten FLNC transcripts; (2) the poly(A) site has at least 5 FLNC transcripts; (3) the difference in usage rates between the top-ranking poly(A) site and the second-ranking site is at least 10%. The statistical results are presented below:

Table 4.4.3.2 Statistics of the primary polyadenylation site utilization

  • GeneID: gene ID
  • Strand: strand.
  • Aligned_reads: read count.
  • Num_sites: the APA site count.
  • Usage_of_first: the utilization rate of the primary polyadenylation site.
  • Usage_of_second: the second polyadenylation site utilization rate.
  • Diff_value(%): the disparity between the utilization of the first and the utilization of the second.

Figure 4.4.3.2 Bos plot of the predominant polyadenylation site utilization

Note: The x-axis represents the quantity of poly(A) sites for each gene and the y-axis indicates the predominant poly(A) utilization rate.

4.4.4 Analysis of Novel Genes and Transcripts

New gene prediction:Based on the mapping results of transcripts to the reference genome, reads that align to unannotated regions in the GTF (Gene Transfer Format) file of the reference genome are defined as novel genes.

Table 4.4.4.1 The result of predicted novel genes

  • GeneID: gene ID.
  • Chromosome: chromosome ID.
  • Start: transcription start site.
  • End: transcription end site.
  • Strand: positive and negative strands.
  • Isoforms: Number of transcripts.

New Transcript Prediction:Based on the mapping results of transcripts to the reference genome, transcripts with structures differing from those in the reference genome GTF file are defined as novel transcripts.

Table 4.4.4.2 Result of novel transcript predictions

  • GeneID: gene ID.
  • IsoformID: transcript ID.
  • Chromosome: chromosome ID.
  • Start: start site.
  • End: end site.
  • Strand: positive and negative strands.
  • Exon_number: the number of exons.

4.4.5 Novel Gene Functional Annotation

  In order to investigate the functional information of novel genes, functional annotation of the novel genes is performed in seven major databases (NR, NT, Pfam, KOG/COG, SwissProt, KEGG, GO). The statistical results are presented in the figure below.

Figure 4.4.5 The annotated results from seven major databases

  • NR: the number of annotated transcripts in the NR database.
  • SwissProt: the number of annotated transcripts in the SwissProt database.
  • KEGG: the number of annotated transcripts in the KEGG database.
  • KOG: the number of annotated transcripts in the KOG/COG database.
  • GO:: the number of annotated transcripts in the GO database.
  • NT: the number of annotated transcripts in the NT database.
  • Pfam: the number of annotated transcripts in the Pfam database.
  • All in Databases: the total number of annotated transcripts across all databases.
  • One in Databases: at least one database annotation has successfully annotated the number of transcripts.
  • Total Annotation: the number of all transcripts.

4.4.5.1 NR Annotation

  The Non-Redundant Protein Database (NR), established and maintained by the National Center for Biotechnology Information (NCBI), represents a comprehensive repository of non-redundant protein sequences. its annotations encompass species-specific information, facilitating taxonomical classification. Leveraging NR database annotations affords access to functional insights into the genes of the respective species.The annotation results are classified as follows:

Figure 4.4.5.1 Annotation of NR database

Note: the x-axis represents the species ID, while the y-axis represents the number of annotated genes.

4.4.5.2 GO Annotation

  GO (Gene Ontology) is the established standard for gene function annotation. The GO provides controlled vocabularies to classify different gene functions: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). GO term is the basic unit of GO system. Each term has a unique identifier. The relationship between the GO term of each ontology can form a Directed Acyclic Topology. More details could be found from the website: Gene Ontology Resource. Novel genes and novel isoforms detected by Transcriptome analysis are annotation with GO terms and the annotation results are classified as follows:

Figure 4.4.5.2 GO annotation classification

Note: The x-axis represents the next term of GO terms for the three main categories of GO, and the y-axis represents the number of genes annotated to this term (including the sub-terms of the term). The three different classifications represent the three fundamental classifications of GO terms (from left to right: biological process, cellular component, molecular function).

4.4.5.3 KOG/COG Annotation

  KOG/COG, short for Cluster of Orthologous Groups of proteins, are divided into two categories: one is the prokaryotic organism database, commonly referred to as the COG database; the other is the eukaryotic organism database, generally known as the KOG database. This database is created and is maintained by NCBI, and is constructed based on the evolutionary relationships of encoded protein systems in complete genomes of bacteria, algae, and eukaryotes. By aligning a protein sequence to a specific KOG/COG, which is composed of orthologous sequences, one can infer the function of that sequence. The KOG/COG database can be classified into a total of twenty-six functional categories, and the annotation results of the KOG/COG database are as follows:

Figure 4.4.5.3 Annotation of KOG/COG database

Note: The x-axis denotes the names of the twenty-six groups of KOG/COG, while the y-axis represents the proportion of genes annotated to each group compared to the total number of annotated genes.

4.4.5.4 KEGG Annotation

  The full name of KEGG is the Kyoto Encyclopedia of Genes and Genomes. It is a database that performs systematic analysis of the metabolic pathways of gene products and compounds within cells, as well as the functions of these gene products. KEGG integrates data from various aspects such as genomics, chemical molecules, and biochemical systems, including metabolic pathways (KEGG PATHWAY), drugs (KEGG DRUG), diseases (KEGG DISEASE), functional models (KEGG MODULE), gene sequences (KEGG GENES), and genomes (KEGG GENOME), among others. The KO (KEGG ORTHOLOG) system links various KEGG annotation systems together, and KEGG has established a comprehensive KO annotation system, which can complete functional annotation of the genome or transcriptome of new species. After annotating genes with KO, they can be classified according to the KEGG metabolic pathways in which they participate.

Figure 4.4.5.4 KEGG Metabolic pathway classification

Note: The numbers on the bar graph represent the number of annotated genes. The y-axis represents the codes of level 1 functional categories, with their explanations detailed in the corresponding legend.

4.4.5.5 Pfam Annotation

  Pfam is a widely utilized protein family database, encompassing over 13,000 confirmed protein families. This database comprises two entities: a high-quality, confirmed Pfam-A database, and a supplementary Pfam-B database that is automatically annotated using the ADDA algorithm. Annotation results are obtained by simultaneously annotating data in both the Pfam-A and Pfam-B databases. The detailed annotation results can be found in the result file.

  Result file directory:
  *_Result_XXXX/04.Structure/04.Annotation/Novel_anno/Sample/Pfam/

4.4.5.6 SwissProt Annotation

  It's a high quality protein sequence database, which brings together experimental results, computed features and scientific conclusions. More details could be found from the following website:Uniprot EMBL-EBI. Diamond is used to annotate novel genes and novel isoforms to Swissprot Database, the detailed annotation results can be found in the result file.

  Result file catalog:
  *_Result_XXXX/04.Structure/04.Annotation/{Unmap_anno|Iso_anno}/Sample/SwissProt/

4.4.5.7 NT Annotation

  The official nucleotide sequence database of the National Center for Biotechnology Information (NCBI) encompasses nucleotide sequences from GenBank, EMBL, and DDBJ. The data has been aligned and annotated against the database, and the detailed annotation results can be found in the result file.

  Result file directory:
  *_Result_XXXX/04.Structure/04.Annotation/Novel_anno/Sample/NT/

4.4.6 Transcription Factor Analysis

   Transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence.Use iTAK software for plant transcription factor prediction during analysis.

Table 4.4.6 The statistical results of plant transcription factors.

  • Gene id: gene id
  • Family: family name of transcription factor.

  The analysis results will be presented in a bar graph, displaying the top thirty transcription factor families annotated with the largest number of transcripts. The results are as follows:

Figure 4.4.6 TF analysis

Note: The The x-axis represents different transcription factor families; The y-axis represents the number of TFs

4.4.7 LncRNA Prediction and Analysis

  Most RNA prediction methods are based on alignment, either pairwise alignment to search for protein evidence or multiple alignments to calculate phylogenetic conservation score (such as CPC, PhyloCSF and RNACode).

  The prediction of long non-coding RNAs (LncRNAs) is based on the high conservation, which is represented by identified transcripts. These transcripts include protein coding RNA and short housekeeping/regulatory RNAs such as snRNAs, snoRNA and tRNA. However, most lncRNAs are less conserved and tend to be lineage specific which greatly limit the discriminative capability of alignment-based methods. Additionally, a large number of protein-coding genes may have alternatively spliced isoforms transcribed from alternative promoters. Homology search approaches lack sensitivity to distinguish lncRNAs from protein-coding transcript.

Figure 4.4.7 The workflow for predicting LncRNAs

4.4.7.1 The Prediction of LncRNA

  The number of noncoding transcripts predicted by various software tools is depicted in a Venn diagram, providing an intuitive display of both the shared and unique noncoding transcripts predicted by each method. To ensure the accuracy of the predictions, the shared results among these software tools are ultimately selected for subsequent analysis. The overlapping predicted results are depicted in the Venn diagram below:

Figure 4.4.7.1 Venn diagram of LncRNA prediction results

4.4.7.2 LncRNA Classification

  After filtering, lncRNAs are classified based on their genomic positions, and the distribution is shown to display the proportions.

Figure 4.4.7.2 LncRNA classification result diagram

4.4.7.3 Comparison Analysis of the Lengths of LncRNA and mRNA

  Generate length distribution density plots for mRNA and the predicted novel lncRNA. The results are displayed in the figure.

Figure 4.4.7.3 Length distribution of LncRNA and mRNA

4.4.7.4 Comparison of Exon Numbers of LncRNA and mRNA

  Compare the number of exons in mRNA and predicted novel lncRNA. The results are illustratd in the figure.

Figure 4.4.7.4 Results of the number of exons included in LncRNA and mRNA

4.4.8 Fusion Transcript

  A fusion gene is defined as a gene that is made by the joining parts of two genes, resulting in their transcription and translation as a single unit. Fusion transcripts are commonly observed in cancer cells, and the detection of fusion transcripts is part of routine diagnostic test of certain cancer types. The analysis principle for detecting fusion genes is as follows:

Figure 4.4.8 Gene structure analysis results

Note: The circos plot is arranged from the outermost to the innermost as follows:

1. Chromosome sequence.
2. Alternative splicing sites (stacked bar graph, with different colors representing different types of alternative splicing: light blue for RI, green for A3, yellow for A5, purple for SE, red for MX, brown for AF, and dark blue for AL).
3. APA site.
4. Distribution of new transcripts, where the density increases as the color tends towards red.
5. Distribution of new genes, where the density increases as the color tends towards red.
6. Density distribution of lncRNAs.
7. Fusion genes, indicated by purple lines (shared) and yellow lines (distinct) representing gene fusions on different chromosomes.

4.4 Quantitative Analysis


4.4.1 Gene Expression Level Analysis

  The quantification of gene expression at the gene level typically uses a simple read-counting model. The number of reads is directly proportional to the true expression level of the gene, as well as positively correlated with the gene's length and sequencing depth. Gene-level quantification analysis is carried out using IsoQuant, which converts read counts into TPM values. IsoQuant is a tool for the genome-based analysis of long RNA reads, which further performs quantification of annotated genes and isoforms. The results provide gene expression profiles for each sample at different expression levels.

Table 4.4.1 Gene quantification statistics

  • TPM_interval: the range of TPM values.
  • Sample*: the range of TPM values corresponding to different samples, along with the number of reads and their respective percentages.

4.4.2 Transcript Expression Level Analysis

  Upon conducting quantitative analysis of the transcriptome sequences aligned to the reference genome, the TPM for transcripts in each sample is as follows:

Table 4.4.2 Transcript quantification statistics

  • FPKM_interval: transcript ID.
  • Sample*: the TPM values corresponding to different samples.

Figure 4.4.2 The gene count statistics across various levels of expression

4.4.2.1 Comparative Analysis of Gene Expression Level

  Gene expression levels among different samples (experimental conditions) are compared using density distributions and box plots of FPKM values for all genes. For replicated samples under the same experimental condition, the final FPKM is calculated as the mean of all replicates' data.

Figure 4.4.2.1.1 The comparative gene expression at sample level

Note: The distribution of FPKM density is presented, with the x-axis denoting log10(FPKM) and the y-axis representing gene density.

Figure 4.4.2.1.2 Comparative analysis of gene expression levels under distinct experimental conditions

Note: The graph depicts box plots of FPKM values, with the x-axis representing the sample names and the y-axis indicating log10(FPKM+1). Each box plot corresponds to five statistical measures (from top to bottom: maximum value, upper quartile, median, lower quartile, and minimum value).

4.4.2.2 Correlation Analysis of Gene Expression


  The correlation of gene expression levels in samples is a crucial indicator of the reliability of experiments and the rationality of sample selection. Pearson correlation coefficient (ratio between the covariance of two variables and the product of their standard deviations) closer to 1 indicates a more similar expression pattern among samples. The correlation coefficients of samples within and between groups are calculated based on the FPKM values of all genes in each sample, and are visualized as a heatmap, providing an intuitive display of inter-group sample differences and intra-group sample replicates. Higher correlation coefficients between samples indicate a closer resemblance in their gene expression patterns. The sample correlation heatmap is shown in the figure below.

Figure 4.4.2.2 Analysis of gene expression correlation

Note: A heatmap depicting the Pearson correlation coefficients between samples, where values approaching 1 indicate stronger correlations.

4.5.3 Transcript Expression Level Analysis

  Upon conducting quantitative analysis of the transcriptome sequences aligned to the reference genome, the FPKM for transcripts in each sample is as follows:

Table 4.4.3 Transcript quantification statistics

  • transcript_id: transcript ID.
  • Sample*: the FPKM values corresponding to different samples.

4.5 Differential Gene Expression Analysis


  The input data for differential gene expression analysis is the read count data obtained from gene expression level analysis. The analysis is mainly divided into three parts:

1.Normize the read count data;
2.Calculate the probability of hypothesis testing (p-value) based on the model;
3. Perform multiple hypothesis testing correction to obtain the False Discovery Rate (FDR) value.

Type Software method Normalization P-value FDR Calculation Method Differentially expressed gene Selection Criteria
Biological replicates DESeq2(Anders et al, 2014) DESeq Negative Binomial Distribution BH |log2(FoldChange)| > 0 & padj < 0.05
No biological replicates DEGseq(Wang et al, 2010) TMM Poisson Distribution BH |log2(FoldChange)| > 1 & padj < 0.005

4.5.1 Result of Differential Gene Expression Analysis

Table 4.5.1 List of differentially expressed genes

  • gene_id: gene ID.
  • Sample1_readcount: the corrected read count values for sample 1.
  • Sample2_readcount: the corrected read count values for sample 2.
  • Log2(FoldChange): the ratio of gene expression level between the treatment group and the control group is processed by the shrinkage model of the differential analysis software, and finally the logarithm is taken with 2 as the base.
  • p_value: p-value in WaldTest.
  • q_value: the corrected pvalue of multiple hypothesis test.

4.5.1.1 Volcano Plot

  The volcano map can visually display the distribution of differentially expressed genes in both treatment and control samples, as shown in the following figure.

Figure 4.5.1.1 Volcano plot of differentially expressed genes

Note: The x-axis represents the log2FoldChange values, reflecting the fold change in gene expression among different samples. The y-axis represents significance level (-log10padj), reflecting the statistical significance of the difference in gene expression levels.

4.5.1.2 Venn Diagram

  The Venn diagram is utilized to visually demonstrate the intersection of differentially expressed genes across different comparative combinations. It facilitates the identification of differentially expressed genes that are either unique to specific comparative combinations or shared among them. In the event of a single differential comparison, the default depiction in the Venn diagram illustrates the co-expressed genes between the treatment and control groups. The total number of differentially expressed genes for a specific comparative combination is represented by the sum of all numbers within the circles in the Venn diagram. The overlapping areas indicate the number of differentially expressed genes shared between the combinations, as depicted in the figure below.

Figure 4.5.1.2 Venn diagram illustrating differentially expressed genes

Note: The distinct colors denote various comparative combinations.

4.6.1.3 Cluster Analysis


  By categorizing genes with comparable expression patterns, it becomes feasible to deduce the roles of unannotated genes or uncover novel functions of known genes. Hierarchical clustering analysis is conducted utilizing the FPKM values of differentially expressed genes under different experimental conditions. Unique colored clusters signify diverse grouping information, where genes within the same cluster demonstrate analogous expression patterns and conceivably share similar functionalities or engage in the same biological processes.

  In a heatmap, genes or samples with similar expression patterns are clustered together. The color in each square does not reflect gene expression values, but rather the normalized values but rather the normalized gene expression values (log10(FPKM+1)). Therefore, the colors in the heatmap can only be used for horizontal comparisons (the expression of the same gene in different samples) and not for vertical comparisons (the expression of different genes in the same sample). The final report displays the intra-sample clustering, as shown in the specific figure below (this figure is generated when the number of differentially expressed genes is less than 50,000,this figure is not generated when the differentially expressed genes exceed 50,000.).

Figure 4.5.1.3.1 Hierarchical clustering of overall FPKM

Note: The log10(FPKM+1) values are normalized and scaled, followed by clustering. Genes with high expression level are denoted in red and those with low expression level in blue. The color spectrum from red to blue signifies the descending order of log10(FPKM+1) values.

  In addition to conducting hierarchical clustering analysis of differentially expressed genes expression at the FPKM level, the relative expression levels of differentially expressed genes (log2(ratios)) are clustered using three methods: H-cluster, K-means, and SOM. Each clustering algorithm delineated the differentially expressed genes into distinct clusters, within which genes exhibited similar trends in expression levels under different treatment conditions. The clustering results for these clusters are as follows: (This figure is generated for differentially expressed gene numbers less than 25000).

Figure 4.5.1.3.2 The clustering results

Note: The line graph of log2(ratios) depicts the relative expression levels of genes within each cluster under different experimental conditions (with the expression level of the first sample as the reference, indicated by the red line). The grey lines in each subgraph represent the genes within a cluster, while the blue line represents the average relative expression levels of all genes within that cluster under different experimental conditions. The x-axis denotes the experimental conditions, and the y-axis represents the relative expression levels.

4.5.2 Differentially Expressed Transcripts Analysis


  The analysis of differentially expressed transcripts is conducted using the Cuffdiff software. From a statistical perspective, the transcripts from different samples are subjected to differential analysis. The results of the differential transcript analysis are presented in the table below:

Table 4.5.2 List of differentially expressed transcripts

  • transcript_id: transcript ID.
  • gene_id: gene ID.
  • locus: the location of the transcript on the chromosome.
  • sample*: sample name.
  • value*: the mean FPKM values of the samples.
  • Log2(FoldChange): the logarithm to the base 2 of the fold change values is calculated.
  • p_value: the p-value of the significance test.
  • q_value: the corrected p-values from the multiple hypothesis testing are used to determine the significance of differential transcript expression, with a threshold set at q-value < 0.05. A smaller q-value indicates a more pronounced difference in transcript expression.
  • Significant: the level of significance.

4.6 GO enrichment analysis of deferentially expressed gene


  Perform GO enrichment analysis using the GOseq software, which is based on the Wallenius non-central hypergeometric distribution, as opposed to the standard hypergeometric distribution. This distribution is characterized by the probability of sampling an individual from a certain category being different from the probability of sampling an individual from outside that category. This disparity in probabilities is estimated based on the preference for gene length, allowing for a more accurate calculation of the enrichment probability of GO terms by differentially expressed genes.

4.6.1 List of Differentially Expressed Genes Enriched in GO

Table 4.6.1 Enriched GO terms of differentially expressed genes


  • GO_accession: unique identification id of Gene Ontology database.

  • Description: the functional description of GO.
  • Term_type: the type of GO term.
  • Over_represented_pValue: statistically significant level of enrichment analysis
  • Corrected_pValue: the adjusted P-value, typically denoted as Corrected_p-Value, rected_p-Value less than 0.05 generally considered statistically significant.
  • DEG_list: the number of all differentialiy expressed genes enriched in GO.
  • Bg_item: the number of background genes associated with this GO.
  • Bg_list: the number of all the background genes with GO annotation.
  • Gene_names: the identification of differentially expressed genes associated with this GO.

4.6.2 Bar Graph

  The bar graph below reflects the distribution of the number of differentially expressed genes in enriched GO terms related to biological processes, cellular components, and molecular functions.

Figure 4.6.1 Enrichment bar graph of GO

Note: The x-axis represents different terms under the biological process, cellular component, and molecular function of GO. The y-axis represents the number of differentially expressed genes enriched in different terms.

4.6.3 DAG graph

  The Directed Acyclic Graph (DAG) serves as a graphical representation of the results of GO enrichment analysis. In the graph, each GO term is a node in the DAG and the branches represent the inclusive relationships, and the range of functions decreases in size from top to bottom. Typically, the top 10 results of the GO enrichment analysis are selected as the primary nodes of the DAG, and interconnected GO terms are displayed through their inclusive relationships. The intensity of color shading reflects the degree of enrichment, and distinct DAGs are constructed for the biological process, molecular function, and cellular component categories.

Figure 4.6.2 differentially expressed gene enrichment in GO and DAG analysis

Note: The x-axis represents the log2FoldChange values, and the y-axis represents significance level (-log10padj). The dashed line is the threshold for differentially expressed gene. Differentially expressed genes are depicted as red dots (up-regulated) and bule dots (down-regulated), while blue dots indicate genes which are not differentlly expressed. The x-axis represents the fold change in gene expression among different samples, and the y-axis represents the statistical significance of the difference in gene expression levels.

4.7 KEGG enrichment analysis of deferentially expressed gene


  Within organisms, different genes work together to perform their biological functions. Significant pathway enrichment identifies the key biochemical metabolism and signaling pathways involving differentially expressed genes. The pathway significance enrichment analysis uses pathways from the KEGG database and applies the hypergeometric test to identify pathways significantly enriched in differentially expressed genes compared to the genomic background. The calculation formula is as follows:

\[p=1-\sum_{i=0}^{m-1}\frac{C_{M}^{i} \ast C_{N-M}^{n-i}}{C_{N}^{n}}\]

An FDR ≤ 0.05 indicates significant enrichment of differentially expressed genes in a pathway. We use KOBAS (3.0) for pathway enrichment analysis.

  • N: The number of genes with pathway annotations among all genes.
  • n: The number of differentially expressed genes among N.
  • M: The number of genes annotated for a specific pathway among all genes.
  • m: The number of differentially expressed genes annotated for a specific pathway.

4.7.1 Differentially Expressed Gene KEGG Enrichment List


Table 4.7.1 Differentially expressed gene KEGG enrichment list

  • Term: the descriptive information of the KEGG pathway.
  • Database: KEGG database
  • ID: the unique identifiers for pathways in the KEGG database.
  • Input number: the number of differentially expressed genes associated with this KEGG pathway.
  • Background number: the total number of background genes associated with this KEGG pathway.
  • P-Value: The P-value of the enrichment analysis.
  • Corrected P-Value: the significance level of the p-value from the hypothesis test.
  • Input: the gene name of differentially expressed genes associated with the aforementioned KEGG pathway.
  • Hyperlink: the enriched KEGG pathway can be accessed through the official KEGG website.

4.7.2 Scatter Plot

  Scatter plot is a graphical representation of the KEGG enrichment analysis results. In this depiction, the degree of KEGG enrichment is measured through Rich factor, Qvalue, and the number of genes enriched in this pathway. The Rich factor refers to the ratio of the number of enriched differentially expressed genes (sample number) to the number of annotated genes (background number) in the pathway. A higher Rich factor value indicates a greater degree of enrichment. Qvalue, on the other hand, is the P-value after multiple hypothesis testing correction, with a range of [0:1]. A value closer to zero signifies a more significant enrichment. The top 20 most significantly enriched pathway items are displayed in the figure below. If the enriched pathway items are less than 20, all will be shown.

Figure 4.7.1 Scatter plot depicting enriched KEGG pathways for differentially expressed genes

Note: The y-axis represents the names of the pathways, while the x-axis represents the Rich factor. The size of the points indicates the number of differentially expressed genes in each pathway, and the color of the points corresponds to different Qvalue ranges.

4.7.3 Enrichment Analysis of KEGG Pathways

  To facilitate viewing the distribution of differentially expressed genes in the pathway map, the differentially expressed genes have been annotated in the pathway map. The viewing method is as follows: upon obtaining the complete analysis results, open the corresponding HTML file in the 'results' directory, and click on the different pathway names to display the corresponding pathway maps. In these maps, KO nodes containing upregulated genes are highlighted in red, while those containing downregulated genes are highlighted in green. Nodes containing both upregulated and downregulated genes are highlighted in yellow. Hovering the mouse over the marked KO nodes will display a box detailing the differentially expressed genes, with the same color-coding, and the numerical value in parentheses representing the log2(Foldchange).

  These steps can be implemented offline. However, should an internet connection be available, clicking on each node will provide access to the specific information page for each KO in the official KEGG database. For the enriched KEGG pathway report, please click here: KEGG pathway result display.

Figure 4.7.2 The significantly enriched KEGG pathways in the metabolic pathway network

5 Appendix


5.1 Methods

  To facilitate user article writing, we have prepared a glossary and methods involved in data analysis.

5.2 Result File Decompression Method

Compression.format user.type method
Compressed files of the form file.tar.gz Unix/Linux/Mac users Use tar -zxvf file.tar.gz command
Windows users Use decompression software such as WinRAR, 7-Zip, etc.
A compressed file of the form file.gz Unix/Linux/Mac users Use the gzip -d file.gz command
Windows users Use decompression software such as WinRAR, 7-Zip, etc.
Compressed file in the form of file.zip Unix/Linux/Mac users Use the unzip file.zip command
Windows users Use decompression software such as WinRAR, 7-Zip, etc.

5.3 Analysis Software Summary

Analysis.content Software and Version Parameter Remark Software Link
Raw data processing and analysis SMRT-Link V13.0 Default parameters FLNC Identification and Transcript Deduplication https://www.pacb.com/support/software-downloads
Database Function Notes Diamond blastx v0.8.36 --more-sensitive -k 10;-e 1e-5;-f 6;-p 4 NR、KOG/COG、Swiss-prot、KEGG https://github.com/bbuchfink/diamond
ncbi-blast-2.7.1+ blastn V2.7.1 -outfmt 6;-evalue 1e-5;-max_target_seqs 10;-num_threads 4 NT Annotation ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Hmmscan v3.1b2 --acc;--domtblout Pfam Annotation http://hmmer.org/download.html
Gene Structure Analysis pbmm2 v1.13.1(with minimap2 v2.26) --preset ISOSEQ;--sort Reference Genome Alignment https://github.com/PacificBiosciences/pbmm2.git
pigeon v1.2.0 Default parameters Transcript Structure Identification https://isoseq.how/classification/pigeon.html
SUPPA V2.3 Default parameters Alternative Splicing Identification https://bitbucket.org/regulatorygenomicsupf/suppa
Transcription Factor Identification Plant:iTAK iTAK:1.7a -f 3F Transcription Factor Identification https://github.com/kentnf/iTAK/
Animal:AnimalTFDB AnimalTFDB:2.0 Default parameters Transcription Factor Identification http://bioinfo.life.hust.edu.cn/AnimalTFDB/
lncRNA Analysis CPC v0.9 Default parameters Coding Potential Prediction http://cpc.cbi.pku.edu.cn/
CNCI V2 Default parameters Coding Potential Prediction https://github.com/www-bioinfo-org/CNCI
PLEK v1.2 Default parameters Coding Potential Prediction https://sourceforge.net/projects/plek/
PfamScan V1.6 Default parameters Protein Domain Detection https://www.ebi.ac.uk/seqdb/confluence/display/THD/PfamScan
Gene Fusion Identification GMAP v2017-06-20 --max-intronlength-ends 50000;-f 4;-z sense_force;-n 0 Gene Fusion Identification http://research-pub.gene.com/gmap/
Alignment Quantification IsoQuant V3.3 Default parameters Gene-Level Quantification https://github.com/ablab/IsoQuant
Differential Expression Analysis DEseq v1.10.1 qvalue < 0.05 With Biological Replicates http://www.bioconductor.org/packages/release/bioc/html/DESeq.html
DEGseq v1.12.0 |log2(FoldChange)| > 1 & qvalue < 0.005 Without Biological Replicates http://www.bioconductor.org/packages/release/bioc/html/DEGseq.html
GO Enrichment GOSeq V1.10.0 Corrected P-Value<0.05 GO Enrichment Analysis http://www.bioconductor.org/packages/release/bioc/html/goseq.html
KEGG Enrichment KOBAS v3.0 Corrected P-Value<0.05 KEGG Enrichment Analysis http://kobas.cbi.pku.edu.cn/download.php

5.4 References


1.H L. New Strategies to Improve Minimap2 Alignment Accuracy.[J]. Bioinformatics, 2021, 21(37): 4572.

2.FRANCISCO J. PARDO-PALACIOS L K Angeles Arzalluz-Luque. SQANTI3: Curation of Long-Read Transcriptomes for Accurate Identification of Known and Novel Isoforms[J]. bioRxiv, 2023.

3.P A G, A P, L T J. Leveraging Transcript Quantification for Fast Computation of Alternative Splicing Profiles.[J]. Rna, 2015, 21(9): 1521.

4.ABDEL-GHANY S E, HAMILTON M, JACOBI J L, et al. A Survey of the Sorghum Transcriptome Using Single-Molecule Long Reads[J]. Nature Communications, 2016, 7: 11706.

5.ZHANG H M, TENG L, LIU C J, et al. AnimalTFDB 2.0: A Resource for Expression, Prediction and Functional Study of Animal Transcription Factors[J]. Nucleic Acids Research, 2015(D1): D76.

6.ZHENG Y, JIAO C, SUN H, et al. ITAK: A Program for Genome-Wide Prediction and Classification of Plant Transcription Factors, Transcriptional Regulators, and Protein Kinases[J]. Molecular Plant, 2016, 9(012): 1667–1670.

7.LIANG S, HAITAO L, DECHAO B, et al. Utilizing Sequence Intrinsic Composition to Classify Protein-Coding and Long Non-Coding Transcripts[J]. Nuclc Acids Research, 2013(17): e166–e166.

8.AIMIN L, JUNYING Z, ZHONGYIN Z. PLEK: A Tool for Predicting Long Non-Coding Rnas and Messenger Rnas Based on an Improved K-Mer Scheme[J]. Bmc Bioinformatics, 2014, 15(1): 311.

9.LEI K, YONG Z, YE Z Q, et al. CPC: Assess the Protein-Coding Potential of Transcripts Using Sequence Features and Support Vector Machine[J]. Nucleic Acids Research, 2007, 35(Web Server issue): W345.

10.FINN R D, PENELOPE C, EBERHARDT R Y, et al. The Pfam Protein Families Database: Towards a More Sustainable Future[J]. Nucleic Acids Research, 2016(D1): D279–D285.

11.KELLIS M L. GENCODE: The Reference Human Genome Annotation for the Encode Project[J]. Genome Research, 2012, 22(9): 1760–1774.

12.GAEL P. ALAMANCOS J L T Amadís Pagès, EYRAS E. Leveraging Transcript Quantification for Fast Computation of Alternative Splicing Profiles[J]. RNA, 2015, 21(9): 1521–1531.

13.PRJIBELSKI M A.D. Accurate Isoform Discovery with Isoquant Using Long Reads. Nat Biotechnol 41, 915–918 (2023). Https://Doi.org/10.1038/S41587-022-01565-Y[J]. Nat Biotechnol, 2023, 21(9): 1521.